import numpy as np
import pandas as pd
import seaborn as sns
from statsmodels.formula.api import ols # For n-way ANOVA
from statsmodels.stats.anova import _get_covariance,anova_lm # For n-way ANOVA
import matplotlib.pyplot as plt
%matplotlib inline
Salary is hypothesized to depend on educational qualification and occupation. To understand the dependency, the salaries of 40 individuals [SalaryData.csv] are collected and each person’s educational qualification and occupation are noted. Educational qualification is at three levels, High school graduate, Bachelor, and Doctorate. Occupation is at four levels, Administrative and clerical, Sales, Professional or specialty, and Executive or managerial. A different number of observations are in each level of education – occupation combination.
[Assume that the data follows a normal distribution. In reality, the normality assumption may not always hold if the sample size is small.]
df = pd.read_csv('SalaryData.csv')
df.head()
| Education | Occupation | Salary | |
|---|---|---|---|
| 0 | Doctorate | Adm-clerical | 153197 |
| 1 | Doctorate | Adm-clerical | 115945 |
| 2 | Doctorate | Adm-clerical | 175935 |
| 3 | Doctorate | Adm-clerical | 220754 |
| 4 | Doctorate | Sales | 170769 |
df.describe()
| Salary | |
|---|---|
| count | 40.000000 |
| mean | 162186.875000 |
| std | 64860.407506 |
| min | 50103.000000 |
| 25% | 99897.500000 |
| 50% | 169100.000000 |
| 75% | 214440.750000 |
| max | 260151.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 40 entries, 0 to 39 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Education 40 non-null object 1 Occupation 40 non-null object 2 Salary 40 non-null int64 dtypes: int64(1), object(2) memory usage: 1.1+ KB
Hypothesis for conducting one-way ANOVA of education qualification with respect to salary is
H0: Salary depends on education Ha: Salary does not depends on education Alpha= 0.05
Hypothesis for conducting one-way ANOVA of occupation with respect to salary is
H0: Salary depends on occupation Ha: Salary does not depends on occupation Alpha= 0.05
H0: Salary depends on education Ha: Salary does not depends on education Alpha= 0.05
formula = 'Salary~Education'
model = ols(formula, df).fit()
aov_table = anova_lm(model)
print(aov_table)
df sum_sq mean_sq F PR(>F) Education 2.0 1.026955e+11 5.134773e+10 30.95628 1.257709e-08 Residual 37.0 6.137256e+10 1.658718e+09 NaN NaN
Found that the P value is less than 0.05. So, we can reject the null hypothesis.
H0: Salary depends on Occupation Ha: Salary does not depends on Occupation Alpha= 0.05
formula = 'Salary~Occupation'
model = ols(formula, df).fit()
aov_table1 = anova_lm(model)
print(aov_table1)
df sum_sq mean_sq F PR(>F) Occupation 3.0 1.125878e+10 3.752928e+09 0.884144 0.458508 Residual 36.0 1.528092e+11 4.244701e+09 NaN NaN
Found that the P value is greater than 0.05. So, we can fail to reject the null hypothesis. So, the null hypothesis is accepted and conclude that Salary depends on Occupation
We have rejected the null hypothesis for "Salary depends on education".
formula = 'Salary ~ Education + Occupation'
model = ols(formula, df).fit()
aov_table = anova_lm(model)
(aov_table)
| df | sum_sq | mean_sq | F | PR(>F) | |
|---|---|---|---|---|---|
| Education | 2.0 | 1.026955e+11 | 5.134773e+10 | 31.257677 | 1.981539e-08 |
| Occupation | 3.0 | 5.519946e+09 | 1.839982e+09 | 1.120080 | 3.545825e-01 |
| Residual | 34.0 | 5.585261e+10 | 1.642724e+09 | NaN | NaN |
P value is smaller than the level of significance α 0.05. We have rejected the null hypothesis for "Salary depends on education". Education means are significantly different. Among all the means in the group at least one mean is substantially different.
sns.pointplot(x='Occupation', y='Salary', hue='Education', data=df, ci=None);
As per the interaction plot result, we can find that the Adm-clerical and sales professionals with bachelors and doctorate degrees getting almost similar salary.
H0: All mean valuess are equal Ha: Atleast one mean value is not equal
formula = 'Salary ~ Education + Occupation + Education:Occupation'
model = ols(formula, df).fit()
aov_table = anova_lm(model)
(aov_table)
| df | sum_sq | mean_sq | F | PR(>F) | |
|---|---|---|---|---|---|
| Education | 2.0 | 1.026955e+11 | 5.134773e+10 | 72.211958 | 5.466264e-12 |
| Occupation | 3.0 | 5.519946e+09 | 1.839982e+09 | 2.587626 | 7.211580e-02 |
| Education:Occupation | 6.0 | 3.634909e+10 | 6.058182e+09 | 8.519815 | 2.232500e-05 |
| Residual | 29.0 | 2.062102e+10 | 7.110697e+08 | NaN | NaN |
The p-value is different from induvidual and interaction terms. In Two-Way ANOVA with and without interaction effect term of Education and Occupation is different. We can reject H0 Null Hypothesis.
Performing the ANOVA on this case study is useful to find out that what term affect the the salary. In that, we can able to conclude that salary highly depends on the occupation. As per the interaction plot, we can suggest to hire HS-grad as Adm-clerical and sales professionals, to hire HS-grad and Bachelors as Pro-speciality and to hire Bachelors and Doctorate as Exce-managerial.
df1=pd.read_csv('Education_Post_12th_Standard.csv')
Basic Data Exploration In this step, we will perform the below operations to check what the data set comprises of. We will check the below things:
head of the dataset shape of the dataset info of the dataset summary of the dataset
df1.head()
| Names | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Abilene Christian University | 1660 | 1232 | 721 | 23 | 52 | 2885 | 537 | 7440 | 3300 | 450 | 2200 | 70 | 78 | 18.1 | 12 | 7041 | 60 |
| 1 | Adelphi University | 2186 | 1924 | 512 | 16 | 29 | 2683 | 1227 | 12280 | 6450 | 750 | 1500 | 29 | 30 | 12.2 | 16 | 10527 | 56 |
| 2 | Adrian College | 1428 | 1097 | 336 | 22 | 50 | 1036 | 99 | 11250 | 3750 | 400 | 1165 | 53 | 66 | 12.9 | 30 | 8735 | 54 |
| 3 | Agnes Scott College | 417 | 349 | 137 | 60 | 89 | 510 | 63 | 12960 | 5450 | 450 | 875 | 92 | 97 | 7.7 | 37 | 19016 | 59 |
| 4 | Alaska Pacific University | 193 | 146 | 55 | 16 | 44 | 249 | 869 | 7560 | 4120 | 800 | 1500 | 76 | 72 | 11.9 | 2 | 10922 | 15 |
df1.shape
(777, 18)
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 777 entries, 0 to 776 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Names 777 non-null object 1 Apps 777 non-null int64 2 Accept 777 non-null int64 3 Enroll 777 non-null int64 4 Top10perc 777 non-null int64 5 Top25perc 777 non-null int64 6 F.Undergrad 777 non-null int64 7 P.Undergrad 777 non-null int64 8 Outstate 777 non-null int64 9 Room.Board 777 non-null int64 10 Books 777 non-null int64 11 Personal 777 non-null int64 12 PhD 777 non-null int64 13 Terminal 777 non-null int64 14 S.F.Ratio 777 non-null float64 15 perc.alumni 777 non-null int64 16 Expend 777 non-null int64 17 Grad.Rate 777 non-null int64 dtypes: float64(1), int64(16), object(1) memory usage: 109.4+ KB
df1.describe()
| Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.000000 | 777.00000 |
| mean | 3001.638353 | 2018.804376 | 779.972973 | 27.558559 | 55.796654 | 3699.907336 | 855.298584 | 10440.669241 | 4357.526384 | 549.380952 | 1340.642214 | 72.660232 | 79.702703 | 14.089704 | 22.743887 | 9660.171171 | 65.46332 |
| std | 3870.201484 | 2451.113971 | 929.176190 | 17.640364 | 19.804778 | 4850.420531 | 1522.431887 | 4023.016484 | 1096.696416 | 165.105360 | 677.071454 | 16.328155 | 14.722359 | 3.958349 | 12.391801 | 5221.768440 | 17.17771 |
| min | 81.000000 | 72.000000 | 35.000000 | 1.000000 | 9.000000 | 139.000000 | 1.000000 | 2340.000000 | 1780.000000 | 96.000000 | 250.000000 | 8.000000 | 24.000000 | 2.500000 | 0.000000 | 3186.000000 | 10.00000 |
| 25% | 776.000000 | 604.000000 | 242.000000 | 15.000000 | 41.000000 | 992.000000 | 95.000000 | 7320.000000 | 3597.000000 | 470.000000 | 850.000000 | 62.000000 | 71.000000 | 11.500000 | 13.000000 | 6751.000000 | 53.00000 |
| 50% | 1558.000000 | 1110.000000 | 434.000000 | 23.000000 | 54.000000 | 1707.000000 | 353.000000 | 9990.000000 | 4200.000000 | 500.000000 | 1200.000000 | 75.000000 | 82.000000 | 13.600000 | 21.000000 | 8377.000000 | 65.00000 |
| 75% | 3624.000000 | 2424.000000 | 902.000000 | 35.000000 | 69.000000 | 4005.000000 | 967.000000 | 12925.000000 | 5050.000000 | 600.000000 | 1700.000000 | 85.000000 | 92.000000 | 16.500000 | 31.000000 | 10830.000000 | 78.00000 |
| max | 48094.000000 | 26330.000000 | 6392.000000 | 96.000000 | 100.000000 | 31643.000000 | 21836.000000 | 21700.000000 | 8124.000000 | 2340.000000 | 6800.000000 | 103.000000 | 100.000000 | 39.800000 | 64.000000 | 56233.000000 | 118.00000 |
dups = df1.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
df1[dups]
Number of duplicate rows = 0
| Names | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate |
|---|
There is no duplicate value. So, we need not to remove any data.
df1.boxplot(column=['Apps'])
plt.show()
df1.boxplot(column=['Accept'])
plt.show()
df1.boxplot(column=['Enroll'])
plt.show()
df1.boxplot(column=['Top10perc'])
plt.show()
df1.boxplot(column=['Top25perc'])
plt.show()
df1.boxplot(column=['F.Undergrad'])
plt.show()
df1.boxplot(column=['P.Undergrad'])
plt.show()
df1.boxplot(column=['Outstate'])
plt.show()
df1.boxplot(column=['Room.Board'])
plt.show()
df1.boxplot(column=['Books'])
plt.show()
df1.boxplot(column=['Personal'])
plt.show()
df1.boxplot(column=['PhD'])
plt.show()
df1.boxplot(column=['Terminal'])
plt.show()
df1.boxplot(column=['S.F.Ratio'])
plt.show()
df1.boxplot(column=['perc.alumni'])
plt.show()
df1.boxplot(column=['Expend'])
plt.show()
df1.boxplot(column=['Grad.Rate'])
plt.show()
#Check for missing value
df1.isnull().sum()
Names 0 Apps 0 Accept 0 Enroll 0 Top10perc 0 Top25perc 0 F.Undergrad 0 P.Undergrad 0 Outstate 0 Room.Board 0 Books 0 Personal 0 PhD 0 Terminal 0 S.F.Ratio 0 perc.alumni 0 Expend 0 Grad.Rate 0 dtype: int64
#There is no missing value
plt.figure(figsize= (15,20))
plt.subplot(6,3,1)
sns.histplot(x=df1.Apps)
plt.subplot(6,3,2)
sns.histplot(x=df1.Accept)
plt.subplot(6,3,3)
sns.histplot(x=df1.Enroll)
plt.subplot(6,3,4)
sns.histplot(x=df1.Top10perc)
plt.subplot(6,3,5)
sns.histplot(x=df1.Top25perc)
plt.subplot(6,3,6)
sns.histplot(x=df1['F.Undergrad'])
plt.subplot(6,3,7)
sns.histplot(x=df1['P.Undergrad'])
plt.subplot(6,3,8)
sns.histplot(x=df1['Outstate'])
plt.subplot(6,3,9)
sns.histplot(x=df1['Room.Board'])
plt.subplot(6,3,10)
sns.histplot(x=df1['Books'])
plt.subplot(6,3,11)
sns.histplot(x=df1['Personal'])
plt.subplot(6,3,12)
sns.histplot(x=df1['PhD'])
plt.subplot(6,3,13)
sns.histplot(x=df1['Terminal'])
plt.subplot(6,3,14)
sns.histplot(x=df1['S.F.Ratio'])
plt.subplot(6,3,15)
sns.histplot(x=df1['perc.alumni'])
plt.subplot(6,3,16)
sns.histplot(x=df1['Expend'])
plt.subplot(6,3,17)
sns.histplot(x=df1['Grad.Rate'])
plt.show()
From above figures, we can say that the list of parameters are right skewed: Apps, Accept, Enroll, Top10perc, F.Undergrad, P.Undergrad, Personal, perc.alumni and Expend. The list of parameters are left skewed: PhD and Terminal. Other parameters are normally distributed.
sns.pairplot(df1)
plt.show()
df1.corr()
| Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Apps | 1.000000 | 0.943451 | 0.846822 | 0.338834 | 0.351640 | 0.814491 | 0.398264 | 0.050159 | 0.164939 | 0.132559 | 0.178731 | 0.390697 | 0.369491 | 0.095633 | -0.090226 | 0.259592 | 0.146755 |
| Accept | 0.943451 | 1.000000 | 0.911637 | 0.192447 | 0.247476 | 0.874223 | 0.441271 | -0.025755 | 0.090899 | 0.113525 | 0.200989 | 0.355758 | 0.337583 | 0.176229 | -0.159990 | 0.124717 | 0.067313 |
| Enroll | 0.846822 | 0.911637 | 1.000000 | 0.181294 | 0.226745 | 0.964640 | 0.513069 | -0.155477 | -0.040232 | 0.112711 | 0.280929 | 0.331469 | 0.308274 | 0.237271 | -0.180794 | 0.064169 | -0.022341 |
| Top10perc | 0.338834 | 0.192447 | 0.181294 | 1.000000 | 0.891995 | 0.141289 | -0.105356 | 0.562331 | 0.371480 | 0.118858 | -0.093316 | 0.531828 | 0.491135 | -0.384875 | 0.455485 | 0.660913 | 0.494989 |
| Top25perc | 0.351640 | 0.247476 | 0.226745 | 0.891995 | 1.000000 | 0.199445 | -0.053577 | 0.489394 | 0.331490 | 0.115527 | -0.080810 | 0.545862 | 0.524749 | -0.294629 | 0.417864 | 0.527447 | 0.477281 |
| F.Undergrad | 0.814491 | 0.874223 | 0.964640 | 0.141289 | 0.199445 | 1.000000 | 0.570512 | -0.215742 | -0.068890 | 0.115550 | 0.317200 | 0.318337 | 0.300019 | 0.279703 | -0.229462 | 0.018652 | -0.078773 |
| P.Undergrad | 0.398264 | 0.441271 | 0.513069 | -0.105356 | -0.053577 | 0.570512 | 1.000000 | -0.253512 | -0.061326 | 0.081200 | 0.319882 | 0.149114 | 0.141904 | 0.232531 | -0.280792 | -0.083568 | -0.257001 |
| Outstate | 0.050159 | -0.025755 | -0.155477 | 0.562331 | 0.489394 | -0.215742 | -0.253512 | 1.000000 | 0.654256 | 0.038855 | -0.299087 | 0.382982 | 0.407983 | -0.554821 | 0.566262 | 0.672779 | 0.571290 |
| Room.Board | 0.164939 | 0.090899 | -0.040232 | 0.371480 | 0.331490 | -0.068890 | -0.061326 | 0.654256 | 1.000000 | 0.127963 | -0.199428 | 0.329202 | 0.374540 | -0.362628 | 0.272363 | 0.501739 | 0.424942 |
| Books | 0.132559 | 0.113525 | 0.112711 | 0.118858 | 0.115527 | 0.115550 | 0.081200 | 0.038855 | 0.127963 | 1.000000 | 0.179295 | 0.026906 | 0.099955 | -0.031929 | -0.040208 | 0.112409 | 0.001061 |
| Personal | 0.178731 | 0.200989 | 0.280929 | -0.093316 | -0.080810 | 0.317200 | 0.319882 | -0.299087 | -0.199428 | 0.179295 | 1.000000 | -0.010936 | -0.030613 | 0.136345 | -0.285968 | -0.097892 | -0.269344 |
| PhD | 0.390697 | 0.355758 | 0.331469 | 0.531828 | 0.545862 | 0.318337 | 0.149114 | 0.382982 | 0.329202 | 0.026906 | -0.010936 | 1.000000 | 0.849587 | -0.130530 | 0.249009 | 0.432762 | 0.305038 |
| Terminal | 0.369491 | 0.337583 | 0.308274 | 0.491135 | 0.524749 | 0.300019 | 0.141904 | 0.407983 | 0.374540 | 0.099955 | -0.030613 | 0.849587 | 1.000000 | -0.160104 | 0.267130 | 0.438799 | 0.289527 |
| S.F.Ratio | 0.095633 | 0.176229 | 0.237271 | -0.384875 | -0.294629 | 0.279703 | 0.232531 | -0.554821 | -0.362628 | -0.031929 | 0.136345 | -0.130530 | -0.160104 | 1.000000 | -0.402929 | -0.583832 | -0.306710 |
| perc.alumni | -0.090226 | -0.159990 | -0.180794 | 0.455485 | 0.417864 | -0.229462 | -0.280792 | 0.566262 | 0.272363 | -0.040208 | -0.285968 | 0.249009 | 0.267130 | -0.402929 | 1.000000 | 0.417712 | 0.490898 |
| Expend | 0.259592 | 0.124717 | 0.064169 | 0.660913 | 0.527447 | 0.018652 | -0.083568 | 0.672779 | 0.501739 | 0.112409 | -0.097892 | 0.432762 | 0.438799 | -0.583832 | 0.417712 | 1.000000 | 0.390343 |
| Grad.Rate | 0.146755 | 0.067313 | -0.022341 | 0.494989 | 0.477281 | -0.078773 | -0.257001 | 0.571290 | 0.424942 | 0.001061 | -0.269344 | 0.305038 | 0.289527 | -0.306710 | 0.490898 | 0.390343 | 1.000000 |
plt.figure(figsize=(12,7))
sns.heatmap(df1.corr(), annot=True, fmt='.2f', cmap='Purples')
plt.show()
Observation:
There are considerable number of features that are highly correlated
We need to perform scaling for this case study. Because in the given data set Apps, Accept and F.Undergrad,.. etc are having values in hundreds and thousands and Top10perc, Top25perc and PhD.. etc are just two digits. Since the data in these variables are of different scales, it is tough to compare these variables.
from sklearn.preprocessing import StandardScaler
std_scale = StandardScaler()
std_scale
StandardScaler()
df2=df1.copy()
df2.head()
| Names | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Abilene Christian University | 1660 | 1232 | 721 | 23 | 52 | 2885 | 537 | 7440 | 3300 | 450 | 2200 | 70 | 78 | 18.1 | 12 | 7041 | 60 |
| 1 | Adelphi University | 2186 | 1924 | 512 | 16 | 29 | 2683 | 1227 | 12280 | 6450 | 750 | 1500 | 29 | 30 | 12.2 | 16 | 10527 | 56 |
| 2 | Adrian College | 1428 | 1097 | 336 | 22 | 50 | 1036 | 99 | 11250 | 3750 | 400 | 1165 | 53 | 66 | 12.9 | 30 | 8735 | 54 |
| 3 | Agnes Scott College | 417 | 349 | 137 | 60 | 89 | 510 | 63 | 12960 | 5450 | 450 | 875 | 92 | 97 | 7.7 | 37 | 19016 | 59 |
| 4 | Alaska Pacific University | 193 | 146 | 55 | 16 | 44 | 249 | 869 | 7560 | 4120 | 800 | 1500 | 76 | 72 | 11.9 | 2 | 10922 | 15 |
df2['Apps'] = std_scale.fit_transform(df2[['Apps']])
df2['Accept'] = std_scale.fit_transform(df2[['Accept']])
df2['Enroll'] = std_scale.fit_transform(df2[['Enroll']])
df2['Top10perc']= std_scale.fit_transform(df2[['Top10perc']])
df2['Top25perc'] = std_scale.fit_transform(df2[['Top25perc']])
df2['F.Undergrad'] = std_scale.fit_transform(df2[['F.Undergrad']])
df2['P.Undergrad'] = std_scale.fit_transform(df2[['P.Undergrad']])
df2['Outstate'] = std_scale.fit_transform(df2[['Outstate']])
df2['Room.Board'] = std_scale.fit_transform(df2[['Room.Board']])
df2['Books'] = std_scale.fit_transform(df2[['Books']])
df2['Personal'] = std_scale.fit_transform(df2[['Personal']])
df2['PhD'] = std_scale.fit_transform(df2[['PhD']])
df2['Terminal'] = std_scale.fit_transform(df2[['Terminal']])
df2['S.F.Ratio'] = std_scale.fit_transform(df2[['S.F.Ratio']])
df2['perc.alumni'] = std_scale.fit_transform(df2[['perc.alumni']])
df2['Expend'] = std_scale.fit_transform(df2[['Expend']])
df2['Grad.Rate'] = std_scale.fit_transform(df2[['Grad.Rate']])
df2.head()
| Names | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Abilene Christian University | -0.346882 | -0.321205 | -0.063509 | -0.258583 | -0.191827 | -0.168116 | -0.209207 | -0.746356 | -0.964905 | -0.602312 | 1.270045 | -0.163028 | -0.115729 | 1.013776 | -0.867574 | -0.501910 | -0.318252 |
| 1 | Adelphi University | -0.210884 | -0.038703 | -0.288584 | -0.655656 | -1.353911 | -0.209788 | 0.244307 | 0.457496 | 1.909208 | 1.215880 | 0.235515 | -2.675646 | -3.378176 | -0.477704 | -0.544572 | 0.166110 | -0.551262 |
| 2 | Adrian College | -0.406866 | -0.376318 | -0.478121 | -0.315307 | -0.292878 | -0.549565 | -0.497090 | 0.201305 | -0.554317 | -0.905344 | -0.259582 | -1.204845 | -0.931341 | -0.300749 | 0.585935 | -0.177290 | -0.667767 |
| 3 | Agnes Scott College | -0.668261 | -0.681682 | -0.692427 | 1.840231 | 1.677612 | -0.658079 | -0.520752 | 0.626633 | 0.996791 | -0.602312 | -0.688173 | 1.185206 | 1.175657 | -1.615274 | 1.151188 | 1.792851 | -0.376504 |
| 4 | Alaska Pacific University | -0.726176 | -0.764555 | -0.780735 | -0.655656 | -0.596031 | -0.711924 | 0.009005 | -0.716508 | -0.216723 | 1.518912 | 0.235515 | 0.204672 | -0.523535 | -0.553542 | -1.675079 | 0.241803 | -2.939613 |
We can see that all variables are normalized and scaled in one scale now.
df2.cov()
| Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Apps | 1.001289 | 0.944666 | 0.847913 | 0.339270 | 0.352093 | 0.815540 | 0.398777 | 0.050224 | 0.165152 | 0.132729 | 0.178961 | 0.391201 | 0.369968 | 0.095756 | -0.090342 | 0.259927 | 0.146944 |
| Accept | 0.944666 | 1.001289 | 0.912811 | 0.192695 | 0.247795 | 0.875350 | 0.441839 | -0.025788 | 0.091016 | 0.113672 | 0.201248 | 0.356216 | 0.338018 | 0.176456 | -0.160196 | 0.124878 | 0.067399 |
| Enroll | 0.847913 | 0.912811 | 1.001289 | 0.181527 | 0.227037 | 0.965883 | 0.513730 | -0.155678 | -0.040284 | 0.112856 | 0.281291 | 0.331896 | 0.308671 | 0.237577 | -0.181027 | 0.064252 | -0.022370 |
| Top10perc | 0.339270 | 0.192695 | 0.181527 | 1.001289 | 0.893144 | 0.141471 | -0.105492 | 0.563055 | 0.371959 | 0.119012 | -0.093437 | 0.532513 | 0.491768 | -0.385370 | 0.456072 | 0.661765 | 0.495627 |
| Top25perc | 0.352093 | 0.247795 | 0.227037 | 0.893144 | 1.001289 | 0.199702 | -0.053646 | 0.490024 | 0.331917 | 0.115676 | -0.080914 | 0.546566 | 0.525425 | -0.295009 | 0.418403 | 0.528127 | 0.477896 |
| F.Undergrad | 0.815540 | 0.875350 | 0.965883 | 0.141471 | 0.199702 | 1.001289 | 0.571247 | -0.216020 | -0.068979 | 0.115699 | 0.317608 | 0.318747 | 0.300406 | 0.280064 | -0.229758 | 0.018676 | -0.078875 |
| P.Undergrad | 0.398777 | 0.441839 | 0.513730 | -0.105492 | -0.053646 | 0.571247 | 1.001289 | -0.253839 | -0.061405 | 0.081304 | 0.320294 | 0.149306 | 0.142086 | 0.232830 | -0.281154 | -0.083676 | -0.257332 |
| Outstate | 0.050224 | -0.025788 | -0.155678 | 0.563055 | 0.490024 | -0.216020 | -0.253839 | 1.001289 | 0.655100 | 0.038905 | -0.299472 | 0.383476 | 0.408509 | -0.555536 | 0.566992 | 0.673646 | 0.572026 |
| Room.Board | 0.165152 | 0.091016 | -0.040284 | 0.371959 | 0.331917 | -0.068979 | -0.061405 | 0.655100 | 1.001289 | 0.128128 | -0.199685 | 0.329627 | 0.375022 | -0.363095 | 0.272714 | 0.502386 | 0.425489 |
| Books | 0.132729 | 0.113672 | 0.112856 | 0.119012 | 0.115676 | 0.115699 | 0.081304 | 0.038905 | 0.128128 | 1.001289 | 0.179526 | 0.026940 | 0.100084 | -0.031970 | -0.040260 | 0.112554 | 0.001062 |
| Personal | 0.178961 | 0.201248 | 0.281291 | -0.093437 | -0.080914 | 0.317608 | 0.320294 | -0.299472 | -0.199685 | 0.179526 | 1.001289 | -0.010950 | -0.030653 | 0.136521 | -0.286337 | -0.098018 | -0.269691 |
| PhD | 0.391201 | 0.356216 | 0.331896 | 0.532513 | 0.546566 | 0.318747 | 0.149306 | 0.383476 | 0.329627 | 0.026940 | -0.010950 | 1.001289 | 0.850682 | -0.130698 | 0.249330 | 0.433319 | 0.305431 |
| Terminal | 0.369968 | 0.338018 | 0.308671 | 0.491768 | 0.525425 | 0.300406 | 0.142086 | 0.408509 | 0.375022 | 0.100084 | -0.030653 | 0.850682 | 1.001289 | -0.160310 | 0.267475 | 0.439365 | 0.289900 |
| S.F.Ratio | 0.095756 | 0.176456 | 0.237577 | -0.385370 | -0.295009 | 0.280064 | 0.232830 | -0.555536 | -0.363095 | -0.031970 | 0.136521 | -0.130698 | -0.160310 | 1.001289 | -0.403448 | -0.584584 | -0.307106 |
| perc.alumni | -0.090342 | -0.160196 | -0.181027 | 0.456072 | 0.418403 | -0.229758 | -0.281154 | 0.566992 | 0.272714 | -0.040260 | -0.286337 | 0.249330 | 0.267475 | -0.403448 | 1.001289 | 0.418250 | 0.491530 |
| Expend | 0.259927 | 0.124878 | 0.064252 | 0.661765 | 0.528127 | 0.018676 | -0.083676 | 0.673646 | 0.502386 | 0.112554 | -0.098018 | 0.433319 | 0.439365 | -0.584584 | 0.418250 | 1.001289 | 0.390846 |
| Grad.Rate | 0.146944 | 0.067399 | -0.022370 | 0.495627 | 0.477896 | -0.078875 | -0.257332 | 0.572026 | 0.425489 | 0.001062 | -0.269691 | 0.305431 | 0.289900 | -0.307106 | 0.491530 | 0.390846 | 1.001289 |
df2.corr()
| Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Apps | 1.000000 | 0.943451 | 0.846822 | 0.338834 | 0.351640 | 0.814491 | 0.398264 | 0.050159 | 0.164939 | 0.132559 | 0.178731 | 0.390697 | 0.369491 | 0.095633 | -0.090226 | 0.259592 | 0.146755 |
| Accept | 0.943451 | 1.000000 | 0.911637 | 0.192447 | 0.247476 | 0.874223 | 0.441271 | -0.025755 | 0.090899 | 0.113525 | 0.200989 | 0.355758 | 0.337583 | 0.176229 | -0.159990 | 0.124717 | 0.067313 |
| Enroll | 0.846822 | 0.911637 | 1.000000 | 0.181294 | 0.226745 | 0.964640 | 0.513069 | -0.155477 | -0.040232 | 0.112711 | 0.280929 | 0.331469 | 0.308274 | 0.237271 | -0.180794 | 0.064169 | -0.022341 |
| Top10perc | 0.338834 | 0.192447 | 0.181294 | 1.000000 | 0.891995 | 0.141289 | -0.105356 | 0.562331 | 0.371480 | 0.118858 | -0.093316 | 0.531828 | 0.491135 | -0.384875 | 0.455485 | 0.660913 | 0.494989 |
| Top25perc | 0.351640 | 0.247476 | 0.226745 | 0.891995 | 1.000000 | 0.199445 | -0.053577 | 0.489394 | 0.331490 | 0.115527 | -0.080810 | 0.545862 | 0.524749 | -0.294629 | 0.417864 | 0.527447 | 0.477281 |
| F.Undergrad | 0.814491 | 0.874223 | 0.964640 | 0.141289 | 0.199445 | 1.000000 | 0.570512 | -0.215742 | -0.068890 | 0.115550 | 0.317200 | 0.318337 | 0.300019 | 0.279703 | -0.229462 | 0.018652 | -0.078773 |
| P.Undergrad | 0.398264 | 0.441271 | 0.513069 | -0.105356 | -0.053577 | 0.570512 | 1.000000 | -0.253512 | -0.061326 | 0.081200 | 0.319882 | 0.149114 | 0.141904 | 0.232531 | -0.280792 | -0.083568 | -0.257001 |
| Outstate | 0.050159 | -0.025755 | -0.155477 | 0.562331 | 0.489394 | -0.215742 | -0.253512 | 1.000000 | 0.654256 | 0.038855 | -0.299087 | 0.382982 | 0.407983 | -0.554821 | 0.566262 | 0.672779 | 0.571290 |
| Room.Board | 0.164939 | 0.090899 | -0.040232 | 0.371480 | 0.331490 | -0.068890 | -0.061326 | 0.654256 | 1.000000 | 0.127963 | -0.199428 | 0.329202 | 0.374540 | -0.362628 | 0.272363 | 0.501739 | 0.424942 |
| Books | 0.132559 | 0.113525 | 0.112711 | 0.118858 | 0.115527 | 0.115550 | 0.081200 | 0.038855 | 0.127963 | 1.000000 | 0.179295 | 0.026906 | 0.099955 | -0.031929 | -0.040208 | 0.112409 | 0.001061 |
| Personal | 0.178731 | 0.200989 | 0.280929 | -0.093316 | -0.080810 | 0.317200 | 0.319882 | -0.299087 | -0.199428 | 0.179295 | 1.000000 | -0.010936 | -0.030613 | 0.136345 | -0.285968 | -0.097892 | -0.269344 |
| PhD | 0.390697 | 0.355758 | 0.331469 | 0.531828 | 0.545862 | 0.318337 | 0.149114 | 0.382982 | 0.329202 | 0.026906 | -0.010936 | 1.000000 | 0.849587 | -0.130530 | 0.249009 | 0.432762 | 0.305038 |
| Terminal | 0.369491 | 0.337583 | 0.308274 | 0.491135 | 0.524749 | 0.300019 | 0.141904 | 0.407983 | 0.374540 | 0.099955 | -0.030613 | 0.849587 | 1.000000 | -0.160104 | 0.267130 | 0.438799 | 0.289527 |
| S.F.Ratio | 0.095633 | 0.176229 | 0.237271 | -0.384875 | -0.294629 | 0.279703 | 0.232531 | -0.554821 | -0.362628 | -0.031929 | 0.136345 | -0.130530 | -0.160104 | 1.000000 | -0.402929 | -0.583832 | -0.306710 |
| perc.alumni | -0.090226 | -0.159990 | -0.180794 | 0.455485 | 0.417864 | -0.229462 | -0.280792 | 0.566262 | 0.272363 | -0.040208 | -0.285968 | 0.249009 | 0.267130 | -0.402929 | 1.000000 | 0.417712 | 0.490898 |
| Expend | 0.259592 | 0.124717 | 0.064169 | 0.660913 | 0.527447 | 0.018652 | -0.083568 | 0.672779 | 0.501739 | 0.112409 | -0.097892 | 0.432762 | 0.438799 | -0.583832 | 0.417712 | 1.000000 | 0.390343 |
| Grad.Rate | 0.146755 | 0.067313 | -0.022341 | 0.494989 | 0.477281 | -0.078773 | -0.257001 | 0.571290 | 0.424942 | 0.001061 | -0.269344 | 0.305038 | 0.289527 | -0.306710 | 0.490898 | 0.390343 | 1.000000 |
plt.figure(figsize=(12,7))
sns.heatmap(df2.corr(), annot=True, fmt='.2f', cmap='Blues')
plt.show()
While comparing the data after scaling, the covariance and the correlation matrices are almost equal.
#Check for presence of outliers in unscaled data
plt.figure(figsize=(15,8))
sns.boxplot(data=df1)
plt.show()
#Check for presence of outliers in scaled data
plt.figure(figsize=(15,8))
sns.boxplot(data=df2)
plt.show()
As per the box plots, we can see that outliers are limited to specified range in scaled data and all data are in one scale compared to the non scaled data.
df_pca=df2.drop(['Names'], axis = 1)
len(df_pca.columns)
17
#Apply PCA taking all features
from sklearn.decomposition import PCA
pca1 = PCA(n_components=17, random_state=123)
pca1_transformed = pca1.fit_transform(df_pca)
#Extract the eigen vectors
pca1.components_
array([[ 2.48765602e-01, 2.07601502e-01, 1.76303592e-01,
3.54273947e-01, 3.44001279e-01, 1.54640962e-01,
2.64425045e-02, 2.94736419e-01, 2.49030449e-01,
6.47575181e-02, -4.25285386e-02, 3.18312875e-01,
3.17056016e-01, -1.76957895e-01, 2.05082369e-01,
3.18908750e-01, 2.52315654e-01],
[ 3.31598227e-01, 3.72116750e-01, 4.03724252e-01,
-8.24118211e-02, -4.47786551e-02, 4.17673774e-01,
3.15087830e-01, -2.49643522e-01, -1.37808883e-01,
5.63418434e-02, 2.19929218e-01, 5.83113174e-02,
4.64294477e-02, 2.46665277e-01, -2.46595274e-01,
-1.31689865e-01, -1.69240532e-01],
[-6.30921033e-02, -1.01249056e-01, -8.29855709e-02,
3.50555339e-02, -2.41479376e-02, -6.13929764e-02,
1.39681716e-01, 4.65988731e-02, 1.48967389e-01,
6.77411649e-01, 4.99721120e-01, -1.27028371e-01,
-6.60375454e-02, -2.89848401e-01, -1.46989274e-01,
2.26743985e-01, -2.08064649e-01],
[ 2.81310530e-01, 2.67817346e-01, 1.61826771e-01,
-5.15472524e-02, -1.09766541e-01, 1.00412335e-01,
-1.58558487e-01, 1.31291364e-01, 1.84995991e-01,
8.70892205e-02, -2.30710568e-01, -5.34724832e-01,
-5.19443019e-01, -1.61189487e-01, 1.73142230e-02,
7.92734946e-02, 2.69129066e-01],
[ 5.74140964e-03, 5.57860920e-02, -5.56936353e-02,
-3.95434345e-01, -4.26533594e-01, -4.34543659e-02,
3.02385408e-01, 2.22532003e-01, 5.60919470e-01,
-1.27288825e-01, -2.22311021e-01, 1.40166326e-01,
2.04719730e-01, -7.93882496e-02, -2.16297411e-01,
7.59581203e-02, -1.09267913e-01],
[-1.62374420e-02, 7.53468452e-03, -4.25579803e-02,
-5.26927980e-02, 3.30915896e-02, -4.34542349e-02,
-1.91198583e-01, -3.00003910e-02, 1.62755446e-01,
6.41054950e-01, -3.31398003e-01, 9.12555212e-02,
1.54927646e-01, 4.87045875e-01, -4.73400144e-02,
-2.98118619e-01, 2.16163313e-01],
[-4.24863486e-02, -1.29497196e-02, -2.76928937e-02,
-1.61332069e-01, -1.18485556e-01, -2.50763629e-02,
6.10423460e-02, 1.08528966e-01, 2.09744235e-01,
-1.49692034e-01, 6.33790064e-01, -1.09641298e-03,
-2.84770105e-02, 2.19259358e-01, 2.43321156e-01,
-2.26584481e-01, 5.59943937e-01],
[-1.03090398e-01, -5.62709623e-02, 5.86623552e-02,
-1.22678028e-01, -1.02491967e-01, 7.88896442e-02,
5.70783816e-01, 9.84599754e-03, -2.21453442e-01,
2.13293009e-01, -2.32660840e-01, -7.70400002e-02,
-1.21613297e-02, -8.36048735e-02, 6.78523654e-01,
-5.41593771e-02, -5.33553891e-03],
[-9.02270802e-02, -1.77864814e-01, -1.28560713e-01,
3.41099863e-01, 4.03711989e-01, -5.94419181e-02,
5.60672902e-01, -4.57332880e-03, 2.75022548e-01,
-1.33663353e-01, -9.44688900e-02, -1.85181525e-01,
-2.54938198e-01, 2.74544380e-01, -2.55334907e-01,
-4.91388809e-02, 4.19043052e-02],
[ 5.25098025e-02, 4.11400844e-02, 3.44879147e-02,
6.40257785e-02, 1.45492289e-02, 2.08471834e-02,
-2.23105808e-01, 1.86675363e-01, 2.98324237e-01,
-8.20292186e-02, 1.36027616e-01, -1.23452200e-01,
-8.85784627e-02, 4.72045249e-01, 4.22999706e-01,
1.32286331e-01, -5.90271067e-01],
[ 4.30462074e-02, -5.84055850e-02, -6.93988831e-02,
-8.10481404e-03, -2.73128469e-01, -8.11578181e-02,
1.00693324e-01, 1.43220673e-01, -3.59321731e-01,
3.19400370e-02, -1.85784733e-02, 4.03723253e-02,
-5.89734026e-02, 4.45000727e-01, -1.30727978e-01,
6.92088870e-01, 2.19839000e-01],
[ 2.40709086e-02, -1.45102446e-01, 1.11431545e-02,
3.85543001e-02, -8.93515563e-02, 5.61767721e-02,
-6.35360730e-02, -8.23443779e-01, 3.54559731e-01,
-2.81593679e-02, -3.92640266e-02, 2.32224316e-02,
1.64850420e-02, -1.10262122e-02, 1.82660654e-01,
3.25982295e-01, 1.22106697e-01],
[ 5.95830975e-01, 2.92642398e-01, -4.44638207e-01,
1.02303616e-03, 2.18838802e-02, -5.23622267e-01,
1.25997650e-01, -1.41856014e-01, -6.97485854e-02,
1.14379958e-02, 3.94547417e-02, 1.27696382e-01,
-5.83134662e-02, -1.77152700e-02, 1.04088088e-01,
-9.37464497e-02, -6.91969778e-02],
[ 8.06328039e-02, 3.34674281e-02, -8.56967180e-02,
-1.07828189e-01, 1.51742110e-01, -5.63728817e-02,
1.92857500e-02, -3.40115407e-02, -5.84289756e-02,
-6.68494643e-02, 2.75286207e-02, -6.91126145e-01,
6.71008607e-01, 4.13740967e-02, -2.71542091e-02,
7.31225166e-02, 3.64767385e-02],
[ 1.33405806e-01, -1.45497511e-01, 2.95896092e-02,
6.97722522e-01, -6.17274818e-01, 9.91640992e-03,
2.09515982e-02, 3.83544794e-02, 3.40197083e-03,
-9.43887925e-03, -3.09001353e-03, -1.12055599e-01,
1.58909651e-01, -2.08991284e-02, -8.41789410e-03,
-2.27742017e-01, -3.39433604e-03],
[ 4.59139498e-01, -5.18568789e-01, -4.04318439e-01,
-1.48738723e-01, 5.18683400e-02, 5.60363054e-01,
-5.27313042e-02, 1.01594830e-01, -2.59293381e-02,
2.88282896e-03, -1.28904022e-02, 2.98075465e-02,
-2.70759809e-02, -2.12476294e-02, 3.33406243e-03,
-4.38803230e-02, -5.00844705e-03],
[ 3.58970400e-01, -5.43427250e-01, 6.09651110e-01,
-1.44986329e-01, 8.03478445e-02, -4.14705279e-01,
9.01788964e-03, 5.08995918e-02, 1.14639620e-03,
7.72631963e-04, -1.11433396e-03, 1.38133366e-02,
6.20932749e-03, -2.22215182e-03, -1.91869743e-02,
-3.53098218e-02, -1.30710024e-02]])
#Extract the eigen values
pca1.explained_variance_
array([5.45052162, 4.48360686, 1.17466761, 1.00820573, 0.93423123,
0.84849117, 0.6057878 , 0.58787222, 0.53061262, 0.4043029 ,
0.31344588, 0.22061096, 0.16779415, 0.1439785 , 0.08802464,
0.03672545, 0.02302787])
# Export the data of the Principal Component (eigenvectors) into a data frame with the original features
df_export = pd.DataFrame(pca1.components_.T,
columns = ['PC1','PC2', 'PC3', 'PC4', 'PC5', 'PC6',
'PC7','PC8', 'PC9', 'PC10', 'PC11', 'PC12', 'PC13', 'PC14', 'PC15', 'PC16', 'PC17'],
index = df_pca.columns)
df_export
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | PC13 | PC14 | PC15 | PC16 | PC17 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Apps | 0.248766 | 0.331598 | -0.063092 | 0.281311 | 0.005741 | -0.016237 | -0.042486 | -0.103090 | -0.090227 | 0.052510 | 0.043046 | 0.024071 | 0.595831 | 0.080633 | 0.133406 | 0.459139 | 0.358970 |
| Accept | 0.207602 | 0.372117 | -0.101249 | 0.267817 | 0.055786 | 0.007535 | -0.012950 | -0.056271 | -0.177865 | 0.041140 | -0.058406 | -0.145102 | 0.292642 | 0.033467 | -0.145498 | -0.518569 | -0.543427 |
| Enroll | 0.176304 | 0.403724 | -0.082986 | 0.161827 | -0.055694 | -0.042558 | -0.027693 | 0.058662 | -0.128561 | 0.034488 | -0.069399 | 0.011143 | -0.444638 | -0.085697 | 0.029590 | -0.404318 | 0.609651 |
| Top10perc | 0.354274 | -0.082412 | 0.035056 | -0.051547 | -0.395434 | -0.052693 | -0.161332 | -0.122678 | 0.341100 | 0.064026 | -0.008105 | 0.038554 | 0.001023 | -0.107828 | 0.697723 | -0.148739 | -0.144986 |
| Top25perc | 0.344001 | -0.044779 | -0.024148 | -0.109767 | -0.426534 | 0.033092 | -0.118486 | -0.102492 | 0.403712 | 0.014549 | -0.273128 | -0.089352 | 0.021884 | 0.151742 | -0.617275 | 0.051868 | 0.080348 |
| F.Undergrad | 0.154641 | 0.417674 | -0.061393 | 0.100412 | -0.043454 | -0.043454 | -0.025076 | 0.078890 | -0.059442 | 0.020847 | -0.081158 | 0.056177 | -0.523622 | -0.056373 | 0.009916 | 0.560363 | -0.414705 |
| P.Undergrad | 0.026443 | 0.315088 | 0.139682 | -0.158558 | 0.302385 | -0.191199 | 0.061042 | 0.570784 | 0.560673 | -0.223106 | 0.100693 | -0.063536 | 0.125998 | 0.019286 | 0.020952 | -0.052731 | 0.009018 |
| Outstate | 0.294736 | -0.249644 | 0.046599 | 0.131291 | 0.222532 | -0.030000 | 0.108529 | 0.009846 | -0.004573 | 0.186675 | 0.143221 | -0.823444 | -0.141856 | -0.034012 | 0.038354 | 0.101595 | 0.050900 |
| Room.Board | 0.249030 | -0.137809 | 0.148967 | 0.184996 | 0.560919 | 0.162755 | 0.209744 | -0.221453 | 0.275023 | 0.298324 | -0.359322 | 0.354560 | -0.069749 | -0.058429 | 0.003402 | -0.025929 | 0.001146 |
| Books | 0.064758 | 0.056342 | 0.677412 | 0.087089 | -0.127289 | 0.641055 | -0.149692 | 0.213293 | -0.133663 | -0.082029 | 0.031940 | -0.028159 | 0.011438 | -0.066849 | -0.009439 | 0.002883 | 0.000773 |
| Personal | -0.042529 | 0.219929 | 0.499721 | -0.230711 | -0.222311 | -0.331398 | 0.633790 | -0.232661 | -0.094469 | 0.136028 | -0.018578 | -0.039264 | 0.039455 | 0.027529 | -0.003090 | -0.012890 | -0.001114 |
| PhD | 0.318313 | 0.058311 | -0.127028 | -0.534725 | 0.140166 | 0.091256 | -0.001096 | -0.077040 | -0.185182 | -0.123452 | 0.040372 | 0.023222 | 0.127696 | -0.691126 | -0.112056 | 0.029808 | 0.013813 |
| Terminal | 0.317056 | 0.046429 | -0.066038 | -0.519443 | 0.204720 | 0.154928 | -0.028477 | -0.012161 | -0.254938 | -0.088578 | -0.058973 | 0.016485 | -0.058313 | 0.671009 | 0.158910 | -0.027076 | 0.006209 |
| S.F.Ratio | -0.176958 | 0.246665 | -0.289848 | -0.161189 | -0.079388 | 0.487046 | 0.219259 | -0.083605 | 0.274544 | 0.472045 | 0.445001 | -0.011026 | -0.017715 | 0.041374 | -0.020899 | -0.021248 | -0.002222 |
| perc.alumni | 0.205082 | -0.246595 | -0.146989 | 0.017314 | -0.216297 | -0.047340 | 0.243321 | 0.678524 | -0.255335 | 0.423000 | -0.130728 | 0.182661 | 0.104088 | -0.027154 | -0.008418 | 0.003334 | -0.019187 |
| Expend | 0.318909 | -0.131690 | 0.226744 | 0.079273 | 0.075958 | -0.298119 | -0.226584 | -0.054159 | -0.049139 | 0.132286 | 0.692089 | 0.325982 | -0.093746 | 0.073123 | -0.227742 | -0.043880 | -0.035310 |
| Grad.Rate | 0.252316 | -0.169241 | -0.208065 | 0.269129 | -0.109268 | 0.216163 | 0.559944 | -0.005336 | 0.041904 | -0.590271 | 0.219839 | 0.122107 | -0.069197 | 0.036477 | -0.003394 | -0.005008 | -0.013071 |
#Extract first pc
df_export1=df_export.T
df_export1.head(1).round(2)
| Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PC1 | 0.25 | 0.21 | 0.18 | 0.35 | 0.34 | 0.15 | 0.03 | 0.29 | 0.25 | 0.06 | -0.04 | 0.32 | 0.32 | -0.18 | 0.21 | 0.32 | 0.25 |
# Explicit form of the first PC
#PC=a11*X1+a12*X2+a13*X3+...+a1n*Xn
for i in range(0, df_export1.shape[0]):
print('{}*{}'.format(pca1.components_[0][i].round(2), df_export1.columns[i]), end='+')
0.25*Apps+0.21*Accept+0.18*Enroll+0.35*Top10perc+0.34*Top25perc+0.15*F.Undergrad+0.03*P.Undergrad+0.29*Outstate+0.25*Room.Board+0.06*Books+-0.04*Personal+0.32*PhD+0.32*Terminal+-0.18*S.F.Ratio+0.21*perc.alumni+0.32*Expend+0.25*Grad.Rate+
#Find cumlative explained variance ratio to cut off for selecting the number of PCs
np.cumsum(pca1.explained_variance_ratio_)
array([0.32020628, 0.58360843, 0.65261759, 0.71184748, 0.76673154,
0.81657854, 0.85216726, 0.88670347, 0.91787581, 0.94162773,
0.96004199, 0.9730024 , 0.98285994, 0.99131837, 0.99648962,
0.99864716, 1. ])
#As per the ratio 9 PCs gives us 91.7% of data. So we can cut off 9 PCs out of 17 PCs. These 9 PCs has been selected on the basis of "cumulative explained variance"
#Cumulative values of the eigenvalues has helped to decide how many PCs we can select based on how much percentage of data covered with that PCs.
df_selected = df_export[['PC1','PC2', 'PC3', 'PC4', 'PC5', 'PC6', 'PC7', 'PC8', 'PC9']]
df_selected
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | |
|---|---|---|---|---|---|---|---|---|---|
| Apps | 0.248766 | 0.331598 | -0.063092 | 0.281311 | 0.005741 | -0.016237 | -0.042486 | -0.103090 | -0.090227 |
| Accept | 0.207602 | 0.372117 | -0.101249 | 0.267817 | 0.055786 | 0.007535 | -0.012950 | -0.056271 | -0.177865 |
| Enroll | 0.176304 | 0.403724 | -0.082986 | 0.161827 | -0.055694 | -0.042558 | -0.027693 | 0.058662 | -0.128561 |
| Top10perc | 0.354274 | -0.082412 | 0.035056 | -0.051547 | -0.395434 | -0.052693 | -0.161332 | -0.122678 | 0.341100 |
| Top25perc | 0.344001 | -0.044779 | -0.024148 | -0.109767 | -0.426534 | 0.033092 | -0.118486 | -0.102492 | 0.403712 |
| F.Undergrad | 0.154641 | 0.417674 | -0.061393 | 0.100412 | -0.043454 | -0.043454 | -0.025076 | 0.078890 | -0.059442 |
| P.Undergrad | 0.026443 | 0.315088 | 0.139682 | -0.158558 | 0.302385 | -0.191199 | 0.061042 | 0.570784 | 0.560673 |
| Outstate | 0.294736 | -0.249644 | 0.046599 | 0.131291 | 0.222532 | -0.030000 | 0.108529 | 0.009846 | -0.004573 |
| Room.Board | 0.249030 | -0.137809 | 0.148967 | 0.184996 | 0.560919 | 0.162755 | 0.209744 | -0.221453 | 0.275023 |
| Books | 0.064758 | 0.056342 | 0.677412 | 0.087089 | -0.127289 | 0.641055 | -0.149692 | 0.213293 | -0.133663 |
| Personal | -0.042529 | 0.219929 | 0.499721 | -0.230711 | -0.222311 | -0.331398 | 0.633790 | -0.232661 | -0.094469 |
| PhD | 0.318313 | 0.058311 | -0.127028 | -0.534725 | 0.140166 | 0.091256 | -0.001096 | -0.077040 | -0.185182 |
| Terminal | 0.317056 | 0.046429 | -0.066038 | -0.519443 | 0.204720 | 0.154928 | -0.028477 | -0.012161 | -0.254938 |
| S.F.Ratio | -0.176958 | 0.246665 | -0.289848 | -0.161189 | -0.079388 | 0.487046 | 0.219259 | -0.083605 | 0.274544 |
| perc.alumni | 0.205082 | -0.246595 | -0.146989 | 0.017314 | -0.216297 | -0.047340 | 0.243321 | 0.678524 | -0.255335 |
| Expend | 0.318909 | -0.131690 | 0.226744 | 0.079273 | 0.075958 | -0.298119 | -0.226584 | -0.054159 | -0.049139 |
| Grad.Rate | 0.252316 | -0.169241 | -0.208065 | 0.269129 | -0.109268 | 0.216163 | 0.559944 | -0.005336 | 0.041904 |
#Extract the eigen vectors
pca2 = PCA(n_components=9, random_state=123)
pca2_transformed = pca2.fit_transform(df_pca)
pca2.components_
array([[ 0.2487656 , 0.2076015 , 0.17630359, 0.35427395, 0.34400128,
0.15464096, 0.0264425 , 0.29473642, 0.24903045, 0.06475752,
-0.04252854, 0.31831287, 0.31705602, -0.17695789, 0.20508237,
0.31890875, 0.25231565],
[ 0.33159823, 0.37211675, 0.40372425, -0.08241182, -0.04477866,
0.41767377, 0.31508783, -0.24964352, -0.13780888, 0.05634184,
0.21992922, 0.05831132, 0.04642945, 0.24666528, -0.24659527,
-0.13168986, -0.16924053],
[-0.0630921 , -0.10124906, -0.08298557, 0.03505553, -0.02414794,
-0.06139298, 0.13968172, 0.04659887, 0.14896739, 0.67741165,
0.49972112, -0.12702837, -0.06603755, -0.2898484 , -0.14698927,
0.22674398, -0.20806465],
[ 0.28131053, 0.26781735, 0.16182677, -0.05154725, -0.10976654,
0.10041234, -0.15855849, 0.13129136, 0.18499599, 0.08708922,
-0.23071057, -0.53472483, -0.51944302, -0.16118949, 0.01731422,
0.07927349, 0.26912907],
[ 0.00574141, 0.05578609, -0.05569364, -0.39543434, -0.42653359,
-0.04345437, 0.30238541, 0.222532 , 0.56091947, -0.12728883,
-0.22231102, 0.14016633, 0.20471973, -0.07938825, -0.21629741,
0.07595812, -0.10926791],
[-0.01623744, 0.00753468, -0.04255798, -0.0526928 , 0.03309159,
-0.04345423, -0.19119858, -0.03000039, 0.16275545, 0.64105495,
-0.331398 , 0.09125552, 0.15492765, 0.48704587, -0.04734001,
-0.29811862, 0.21616331],
[-0.04248635, -0.01294972, -0.02769289, -0.16133207, -0.11848556,
-0.02507636, 0.06104235, 0.10852897, 0.20974423, -0.14969203,
0.63379006, -0.00109641, -0.02847701, 0.21925936, 0.24332116,
-0.22658448, 0.55994394],
[-0.1030904 , -0.05627096, 0.05866236, -0.12267803, -0.10249197,
0.07888964, 0.57078382, 0.009846 , -0.22145344, 0.21329301,
-0.23266084, -0.07704 , -0.01216133, -0.08360487, 0.67852365,
-0.05415938, -0.00533554],
[-0.09022708, -0.17786481, -0.12856071, 0.34109986, 0.40371199,
-0.05944192, 0.5606729 , -0.00457333, 0.27502255, -0.13366335,
-0.09446889, -0.18518152, -0.2549382 , 0.27454438, -0.25533491,
-0.04913888, 0.04190431]])
The eigenvectors indicate the final components count limited as 9 in the array. With this indication we can get the ~92% information using 9 PCs
The business implication of using PCA in this study is reducing the dimensionality of datasets, increasing interpretability but at the same time minimizing information loss. As per the analysis using cumlative explained variance, we can get the 92% of information by using 9 PCs out of 17 PCs. In this case, we have reduced ~50% dimensionality of datasets for minimum of 8% data loss.